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METHOD AND APPARATUS FOR PREDICATION USING MICRO-OPERATIONS 

Background 

Technical Field 

[0001] The present disclosure relates generally to information processing systems and, 
5 more specifically, to processors that utilize predication. 

Background Art 

[0002] In modern processor designs, one method of increasing performance is executing 
multiple instructions per clock cycle. The performance of such processors is dependent on the 
amount of instruction level parallelism (ILP) exposed by the compiler and exploited by the 
10 microarchitecture. One approach for increasing the amount of ILP that is available for compile- 
time instruction scheduling is predication. Predication is also useful for decreasing the number 
of branch mispredictions. 

[0003] Predication is a method of converting control flow dependencies to data 
dependencies. A predicated execution model is an architectural model where an instruction is 
15 guarded by a Boolean operand whose value determines if the instruction is executed or nullified. 
For example, the Explicitly Parallel Instruction Computing ("EPIC") architecture utilized by 
Itanium® and Itanium® 2 microprocessors features a set of 64 predicate registers to support 
conditional execution of instructions by providing the Boolean predication operand. 

[0004] To explore ILP, a compiler can take full advantage of predication by applying a 
20 technique referred to as if-conversion to convert control flow dependence into data flow 
dependence. With if-conversion, the compiler can collapse multiple control flow paths and 
schedule them based only on data dependencies. 
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[0005] The Itanium® and Itanium® 2 microprocessors, for instance, support predication 
and issue instructions in program order. However, predication may also provide performance 
benefits for processors that allow instructions to issue out-of-order. An out-of-order, execution 
model is, in general, more complex than a static execution model. Static execution executes 
5 code in the order as scheduled statically by the compiler while out-of order execution permits the 
processor to dynamically adjust instruction scheduling to the run-time behavior of the program. 

[0006] Because of this ability to adapt to the run-time environment, dynamic out-of-order 
execution has been employed in many processor designs. For processors that allow instructions 
to issue out of order, register renaming is used to increase the number of instructions that a 
10 superscalar processor can issue in parallel. Renaming each independent definition of an 

architectural register to different physical registers in a physical register file improves parallelism 
by preventing dependence-induced delays. Renaming removes WAR (write-after-read) and 
WAW (write-after- write) dependencies and allows multiple independent instructions that write 
to the same architectural register to issue concurrently. 

i 

1 5 [0007] Efficiency of the renaming mechanism in an out-of-order processor my drive 

processor performance. That is, the renaming mechanism and its associated physical register file 
may represent critical resources for an out-of-order processor architecture. Implementation of 
predication on such processors poses interesting issues. 
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Brief Description of the Drawings 

[0008] The present invention may be understood with reference to the following drawings 
in which like elements are indicated by like numbers. These drawings are not intended to be 
limiting but are instead provided to illustrate selected embodiments of a method, apparatus and 
5 system for implementing predicated instructions using micro-operations. 

[0009] Fig. 1 is a block diagram of at least one embodiment of a processing system capable 
of utilizing disclosed techniques. 

[00010] Fig. 2 is a block diagram illustrating micro-architectural features of at least one 
embodiment of a processor. 

10 [00011] Figs. 3 is a flow diagram illustrating at least one embodiment of a generalized 
execution pipeline for an out-of-order processor. 

[00012] Fig.4 is a block diagram illustrating at least one embodiment of hardware to select 
between old and new values for a register during execution of a multiple-destination predicated 
instruction. 

15 [00013] Fig. 5 is a block diagram illustrating at least one embodiment of three one- 
destination micro-ops to implement a sample multiple-destination predicated instruction. 

[00014] Fig. 6 is a flowchart illustrating at least one method for generating single-destination 
micro-ops for a predicated multi-destination instruction. 

[00015] Fig. 7 is a flowchart illustrating further detail for at least one embodiment of a 
20 method for generating single-destination micro-ops for a predicated multi-destination instruction. 

[00016] Fig. 8 is block diagram illustrating decomposition of an "or" form parallel compare 
instruction. 
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[00017] Fig. 9 is a block diagram illustrating a first selection circuit for execution of a 
parallel compare instruction and illustrative a simplified selection circuit for execution of a 
decomposed conditional move micro-operation. 

[00018] Fig. 10 is a block diagram illustrating decomposed micro-operations and selection 
5 circuit for an "or.andcm" form of parallel compare instruction. 

[00019] Fig. 1 1 is a block diagram illustrating decomposed micro-operations and selection 
circuit for an "and.orcm" form of parallel compare instruction. 

Detailed Description 

[00020] Described herein are selected embodiments of a system, apparatus and methods for 
10 implementing predicated instructions using micro-operations. In the following description, 

numerous specific details such as processor types, pipeline stages, instruction formats, renaming 
mechanisms, and control flow ordering have been set forth to provide a more thorough 
understanding of the present invention. It will be appreciated, however, by one skilled in the art 
that the invention may be practiced without such specific details. Additionally, some well- 
15 known structures, circuits, and the like have not been shown in detail to avoid unnecessarily 
obscuring the present invention. 

[00021] Fig. 1 is a block diagram illustrating at least one embodiment of a processing system 
100 capable of decomposing predicated and non-predicated instructions into multiple micro- 
operations called "micro-ops" or "|i-ops." The instructions to be decomposed may include both 
20 arithmetic and memory instructions. Fig. 1 illustrates that processing system 100 includes a 

memory 150 in which the instructions may be stored. The processing system 100 also includes a 
processor 101 to perform out-of-order execution of predicated and non-predicated instructions. 
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The processor 101 utilizes a pipeline 300 (see Fig. 3), which includes multiple dynamic pipeline 
stages. 

[00022] More specifically, at least one embodiment of processor 101 includes a micro-code 
engine, referred to herein as |i-op generator 1 16, to decompose instructions into a series of 
5 micro-operations. For each instruction 120, the n-op generator 116 generates a series of micro- 
operations 172a-172n. The micro-operations 172a-172n include "standard" micro-ops that are 
generated regardless of whether the instruction 120 is predicated. If the instruction 120 is 
predicated, the micro-operations 172a-172n include additional predication-support micro-ops. 

[00023] The resultant micro-operations 175a-175n, including the standard micro-ops and 
10 any additional predication support micro-ops, each specify a single destination register, even if 
the original instruction 120 indicates multiple destination registers. It is the resultant micro- 
operations 175a-175n that are executed, rather than the predicated instruction 120, by execution 
units 175. Of course, ellipses in Fig. 1 indicate that additional operations (such as, for example, 
register renaming) may be performed on the resultant micro-operations 175a-175n before 
1 5 execution by the execution units 1 75 . 

[00024] Also, one of skill in the art will recognize that additional operations may be 
performed on the instruction 120 before the ji-op generator 116 receives the instruction 120. For 
example, the instruction 120 may be fetched from an instruction cache 160 and may also be 
decoded before being forwarded to the ^-op generator 1 16. In addition, responsive to a cache 
20 miss in the instruction cache 160, the instruction 120 may be retrieved from an instruction space 
140 in the memory 150. 

[00025] Fig. 2 illustrates additional micro-architectural details for at least one embodiment 

of a processor 101a. Processor 101a includes physical registers 104 and a rename map table 102. 
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The processor 101 also includes an out-of-order ("OOO") register rename unit 106. The 000 
rename unit 106, map table 101 and physical registers 104 are all utilized for the purpose of 
000 register renaming. 

[00026] Out-of-order rename unit 106 performs renaming by mapping an architectural 
5 register to a physical register 104 in order to dynamically increase ILP in the instruction stream. 
That is, for each occurrence of an architectural register in an instruction in the instruction stream 
of the processor 101a, out-of-order rename unit 106 may map such occurrence to a physical 
register. As used herein, the term "instruction" is intended to encompass any type of instruction 
that can be understood and executed by functional units 175, including macro-instructions and 
10 micro-operations. One of skill in the art will recognize that renaming may be performed 
concurrently for multiple threads. 

[00027] During out-of-order renaming for architectural registers, at least one embodiment of 
the out-of-order rename unit 106 enters data into the map table 102. The map table 102 is a 
storage structure to hold one or more rename entries. In practice, the actual entries of the map 
15 table 102 form a translation table that keeps track of mapping of architectural registers, which are 
defined in the instruction set, to physical registers 104. The physical registers 104 may maintain 
intermediate and architected data state for the architectural register being renamed. 

[00028] Accordingly, it has been described that the map table 102 and physical registers 104 
facilitate renaming, by renamer unit 106, of architectural registers defined in an instruction set. 
20 The renaming may occur during an out-of-order rename pipeline stage 308 (see Fig. 3). 

[00029] Fig. 2 illustrates that an embodiment 101a of a processor may also include an 

architectural renamer 1 1 8 to provide renaming that supports register windowing and register 

rotation. For such embodiment, a portion of the registers in the processor 101 is utilized to 
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implement a register stack to provide fresh registers, called a register frame (also referred to as a 
"window"), when a procedure is called with an allocation instruction. For at least one 
embodiment, the first 32 registers of a 128-register register file are static, and the remaining 96 
registers implement a register stack to provide fresh registers when an allocation instruction is 
5 executed (which typically occurs after a call instruction). One commonly-recognized benefit of 
register windowing is reduced call/return overhead associated with saving and restoring register 
values that are defined before a subroutine call, and used after the subroutine call returns. 

[00030] Fig. 2 illustrates that the processor 101a may also include a register stack engine 
(RSE) 122. The RSE 122 works in conjunction with the architectural renamer 1 18 to save and 
10 restore the contents of physical registers to memory when needed. 

[00031] Fig. 2 further illustrates that the processor 101a may also include a micro-op queue 
173. The micro-op queue 173 holds micro-ops 172a-172n that have been generated by the ^-op 
generator 116 until such micro-ops are forwarded to the OOO rename unit 106. 

[00032] Brief reference to Fig. 3 illustrates an embodiment of the execution pipeline 300 
15 mentioned above. Fig. 3 illustrates that generating micro-ops during decomposition of an 
instruction may occur during a |i-op generation phase 31 10 of an execution pipeline 300. The 
illustrative pipeline 300 illustrated in Fig. 3 includes the following stages: instruction pointer 
generation 302, instruction fetch 304, instruction decode 306, architectural register rename 308, 
H-op generation 310, physical register rename 311, dispatch 312, execution 313, and instruction 
20 retirement 314. The pipeline 300 illustrated in Fig. 3 is illustrative only; the techniques 

described herein may be used on any processor. For an embodiment in which the processor 
utilizes an execution pipeline 300, the stages of a pipeline 300 may appear in different order than 
that depicted in Fig. 3. 
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[00033] The techniques disclosed herein may be utilized on a processor whose pipeline 300 
may include different or additional pipeline stages to those illustrated in Fig. 3. For example, 
alternative embodiments of the pipeline 300 may include additional pipeline stages for rotation, 
expansion, exception detection, etc. In addition, a VLIW-type (very long instruction word) 
5 processor may include different pipeline stages, such as a word-line decode stage, than appear in 
the pipeline for a processor that includes variable-length instructions in its instruction set. 

[00034] Leaving Fig. 3, we turn now to a further discussion of predication-support micro- 
ops. One of the interesting issues to be dealt with when performing predicated instructions in an 
out-of-order processor arises from the fact that predicated instructions may indicate more than 
10 two source operands. Assume, for example, that a processor provides 64 predicate registers and 
that "(qp)" indicates that the predicate register that provides the qualifying predicate value for a 
given instruction. Consider the following example code sequence generated by a compiler: 

a. add rl = r3, r4 

b. (qp) add rl = r2, r3 

15 The first instruction, instruction a, is an unconditional instruction that generates a result value in 
destination register rl. The result value generated for register rl by instruction a is referred to 
herein as the "first" value of rl . Keep in mind that, for a processor that utilizes OOO renaming 
in order to dynamically increase ILP, the "first" value of rl may reside in a physical rename 
register as indicated by the rename map table 102 (Fig. 2). 

20 [00035] The second instruction, instruction b, is a predicated instruction indicating a single 
destination register, rl. If the predicate value in qp is true, the add operation indicated by 
instruction b is not to be performed. In such case, a "second" result value of rl is generated. 
(Again, the value may actually reside in a physical rename register such one of the physical 
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register 104s illustrated in Fig. 2). However, if the predicate value in qp is false, then the add 
operation indicated by instruction b will not be performed. In such case, the value of rl after 
execution of statement b, when qp is false, is the "first" value of rl. Essentially, then, instruction 
b is a select operation. 

5 [00036] Fig. 4 illustrates that sample instruction b effectively takes four sources to determine 
the value of rl : the "first" value of rl, qp, r2 and r3. The determination of the value of rl is 
basically a mux function, with the predicate value being the selection input for the mux 400. Fig. 
4 illustrates that the hardware to implement such a selection function is relatively complex, 
requiring four inputs and chaining together of all producers for the potential rl value. In 

10 addition, having more than two source operands for the instruction also leads to increased 
complexity for the OOO rename unit 106 (Figs. 1, 2) and the register file. 

[00037] Fig. 5 illustrates that, in order to simplify the hardware that executes a predicated 
instruction having more than two source operands, a sequence of three micro-operations 502, 
504, 506 may be generated for instruction b (500). The following sequence of micro-operations 
15 may be produced by the n-op generator 116 (Figs. 1, 2) for sample instruction b set forth above. 
That is, predicated instruction b may be decomposed by ^-op generator 116 into the following 
multiple micro-ops, where each micro-op has only two sources and only one destination: 

502. appendtemp = rl, qp 

503. add rl=r2,r3 
20 504. cmov rl = temp, rl 

[00038] Micro-op 502, the append operation, is a predication-support micro-op that copies 
the "first" value of rl, plus the predicate value in qp, into a non-architected internal temporary 
rename register (such as one of the physical rename registers 104 illustrated in Fig. 2). Note that 
this approach results, for at least one embodiment, in the addition of an extra bit to the data 
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associated with the temp register. For at least one embodiment, the integer register file thus 
accommodates 64 bits of integer data in addition to a 65 th bit for the predicate bit associated with 
the temp register. (For at least one embodiment, the register file may also include one or more 
additional bits, such as Not-a-Thing bit to support deferred exception processing, associated with 
5 the temp register). 

[00039] Micro-op 504, the add operation, is essentially a normal non-predicated add micro- 
op. It indicates only one destination register and , at most, two source registers. This type of 
instruction is referred to herein as a "standard" micro-op. A standard micro-op addresses the 
basic operation indicated by the instruction 120 (such as instructions 1-5 set forth in Table 1), 
1 0 without consideration of predication. 

[00040] Micro-op 506, the conditional move operation, is a predication-support micro-op 
that effects a selection between "first" and "second" values of rl . The conditional move 
("cmov") micro-op 506 uses the extra bit appended to temp, which represents the qp value for 
the original predicated instruction 500, to select between the data value of temp (the original, or 

15 "first," value for rl) and the new, or "second," value of rl computed by the add instruction 504. 
If the appended bit is true, then the predicate register for the original instruction 500 indicated a 
"true" value. In such case, the new "second" value of rl is copied over itself into rl. If, 
however, the appended bit is false, then the predicate register for the original instruction 500 
indicated a "false" value. In such case, the original "first" value of rl is moved from temp back 

20 intorl. 

[00041] Fig. 5 illustrates that the hardware 520 to execute the cmov instruction 506 is much 
simpler than the hardware 400 for execution of instruction b (500) that is illustrated in Fig. 4. 
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After decomposition, simplified execution hardware 520 for the cmov instruction 506 only 
receives two inputs: temp and rl. 

[00042] For an embodiment of processor 1 02a capable of executing predicated instructions 
having two destinations, the process of decomposing such predicated instructions into multiple 

5 micro-operations may be even more complicated than the example illustrated in Fig. 5. For 
instance, extra predication-support append and cmov micro-ops may be generated for each 
destination register indicated by a multiple-destination instruction. For purposes of illustration, 
Table 1 sets forth micro-ops that may be generated for the example illustrated in Fig. 3 and also 
for multiple-destination predicated instructions. For at least one embodiment, ^-op generator 

10 116 (Fig. 1) generates the indicated micro-ops during a micro-op generation stage 310 of an 
execution pipeline 300 (Fig. 3). One of skill in the art will recognize that the instruction forms 
set forth in Table 1 are illustrative only, and are not intended to be an exhaustive list. Nor it is 
intended that the sample syntax for instructions and micro-ops set forth in Table 1 be taken to be 
limiting. 
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Table 1 



Instructions 


Micro-op Sequences 


1. 


(qp) add rl = r2, r3 


la 


append temp = rl, qp 


lb 


add rl = r2, r3 


lc 


cmov rl = temp, rl 


2. 


(qp) ld4 rl = [r3], r2 


2a 


append tempi = rl, qp 


2b 


append temp2 = r3, qp 


2c 


ld4 rl = [r3] 


2d 


add r3 = r3, r2 


2e 


cmov rl = temp, rl 


2f 


cmov r3 = temp2, r3 


3. 


(qp) cmp.eq pi, p2 = r2, r3 


3a 


append p-templ = pi, qp 


3b 


append p-temp2 = p2, qp 


3c 


cmp.eq pi =r2, r3 


3d 


cmp.ne p2 = r2, r3 


3e 


cmov pi = p-templ, pi 


3f 


cmov p2 = p-temp2, p2 


4. 


(qp) cmp.eq.unc pi, p2 = r2, r3 


4a 


cmp.eq pi =r2, r3 


4b 


cmp.ne p2 = r2, r3 


4c 


cmov.unc pi = qp, pi 


4d 


cmov.unc. p2 = qp, p2 


5. 


(qp) cmp.eq.or pi, p2 = r2, r3 

(See Tables 2-11 for more 
information concerning 
parallel compare instructions) 


5a 


append p-templ =pl,qp 


5b 


append p-temp2 = p2, qp 


5c 


cmp.eq pi =r2, r3 


5d 


cmp.eq p2 = r2, r3 


5e 


cmov.or pi = p-templ, pi 


5f 


cmov.or p2 = p-temp2, p2 



[00043] The first instruction listed in Table 1 is the predicated one-destination add example 
illustrated in Fig. 5. The second instruction listed in Table 1 is, for at least one embodiment, a 
5 load instruction of the "register base update" form supported by the instruction set architecture of 
the Itanium® and Itanium® 2 microprocessors. To execute the instruction, a four-byte value is 
read from memory starting at the address specified by the value in r3. The value, which may be 
zero-extended, is placed in rl. The value in r3 is added to the value in r2, and the result is then 



-13- 



042390.P17014 
Express Mail No.: EV325527992US 



placed back in r3. In this manner, the base address used for the load is incremented by the value 
in r2 after the load operation has been performed. 

[00044] Table 1 illustrates that, for at least one embodiment, the load register base update 
form of instruction is decomposed (by, for instance, a |i-op generator such as 1 16 illustrated in 
5 Figs. 1 and 2) into six micro-operations, 2a-2f. In order to simplify instruction 2 into standard 
micro-ops that each include only one destination and two sources, a standard non-predicated load 
instruction (2c) is generated and a separate standard add instruction to increment the base address 
by the value in r2 is also generated (see 2d). These standard micro-ops, 2c and 2d, are generated 
regardless of whether the load register base update instruction is predicated. 

10 [00045] In addition, because instruction 2 is predicated, Table 1 indicates that predication- 
support micro-ops 2a, 2b, 2e and 2f are generated. One append micro-op and one cmov micro- 
op is generated for each predicated destination operand (rl and r3). Thus, predication-support 
append micro-op 2a and cmov micro-op 2e are generated for destination register rl. Predication- 
support append micro-op 2b and cmov micro-op 2f are generated for destination register r3. 

15 Each of the six decomposed micro-ops 2a-2f indicates a single destination register and two 
source operands. 

[00046] Table 1 illustrates that micro-op 2e is a predication-support cmov micro-op that 
conditionally overwrites register rl with the "first" value of rl that was saved into tempi as a 
result of micro-op 2a. That is, if the appended predicate bit value in tempi indicates a "false" 
20 value, then the predicate for the original instruction (instruction 2) was a "false" value. In such 
case, the value of rl computed as a result of micro-op 2c is overwritten with the "first" value of 
rl saved in tempi as a result of micro-op 2a. Similarly, if the appended predicate bit value in 
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tempi indicates a "true" value, then the value of rl computed as a result of micro-op 2c is the 
desired value for rl. In such case, the value of rl computed as a result of micro-op 2c is copied 
to itself as a result of micro-op 2e. 

[00047] Similarly, micro-op 2f is a conditional move instruction for the second destination 
5 register, register r3. Micro-op 2f overwrites register r3 with the "first" value of r3 that was saved 
into temp2 as a result of micro-op 2b. If the appended predicate bit value in temp2 indicates a 
"false" value, then the predicate for the original instruction (instruction 2) was a "false" value. 
In such case, the value of r3 computed as a result of micro-op 2d is overwritten with the "first" 
value of r3 saved in temp2 as a result of micro-op 2b. Similarly, if the appended predicate bit 
10 value in temp2 indicates a "true" value, then the value of r3 computed as a result of micro-op 2d 
is the desired value for r3. In such case, the value of r3 computed as a result of micro-op 2d is 
copied to itself as a result of micro-op 2f. 

[00048] One of skill in the art will recognize that the series of micro-ops 2a-2f set forth in 
Table 1 is only one suggested embodiment and should not be taken to be limiting. For example, 
15 an alternative series of micro-ops that may be generated for instruction 2 during decomposition 
of instruction 2 is set forth in Table la: 



Table la - Alternative Load Micro-Op Sequence 



Instruction 


Micro-op Sequence 


1. 


(qp)ldrl=[r3] 


6a 


append temp3 = r3, qp 


6b 


Id tempi = [temp3] 


6c 


cmov rl = tempi, rl 
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[00049] Table la illustrates an alternative embodiment wherein the qualifying predicate is 
appended to the address used for the load operation rather than being appended to the incoming 
value of one of the destination registers indicated by the instruction. In other words, the append 
micro-op 6a indicates that the qp value is to be stored as an additional bit with the load address. 

5 [00050] Similarly, Table lb illustrates that similar alternative append processing may be 
employed during decomposition of store instructions: 

Table lb - Alternative Store Micro-Op Sequence 



Instruction 


Micro-op Sequence 


1. 


(qp) store [r3] = r2 


7a 


append temp3 = r3, qp 


7b 


store [temp3] = r2 



The store micro-operation 7b is conditional. Operation of a processor executing the store micro- 
op 7b depends on the qp value. For at least one embodiment, the qp value appended to the store 
10 address determines whether the processor will perform store-to-load forwarding. Similarly, 
operation of a processor executing the store micro-op 7b, as well as the load micro-op 6b 
illustrated in Table lb, depends on the qp value to determine whether to the processor should 
perform certain operations, such as cache miss and translation lookaside buffer miss processing. 

[00051] The third predicated instruction indicated in Table 1 is a predicated "compare 
15 equal" instruction to determine whether the two register values are equal and to set multiple 

predicate register values accordingly. The instruction indicates two destinations, which are both 
predicate registers. If the two values in registers r2 and r3 are equal, then, during execution, the 
first destination (pi) is set to a "true" value while the second destination (p2) is set to a "false" 
value. Converse result values are generated during execution if the two register values (r2 and 
20 r3) are not equal to each other. One of skill in the art will recognize that instruction 3 is 



042390.P17014 
Express Mail No.: EV325527992US 



presented for illustrative purposes only and should not be taken to be limiting. For example, any 
other type of compare instruction is included within the scope of the present disclosure, 
including at least the following types of compare instructions: equal, not equal, greater than, less 
than, greater than or equal, less than or equal, etc. 

[00052] Table 1 illustrates that the compare equal instruction is decomposed (by, for 
instance, a u-op generator such as 1 16 illustrated in Figs. 1 and 2) into six micro-operations, 3a- 
3f. A standard compare micro-op 3c is generated to set pi to a "true" value if the values in r2 
and r3 are equal. If the two values are not equal, execution of standard micro-op 3c will set the 
value of pi to "false." 

[00053] A second standard compare micro-op (3d) is also generated to set p2 to a "true" 
value if the values in r2 and r3 are not equal. If the two values are equal, execution of micro-op 
3c will set the value of p2 to "false." 

[00054] In addition, because instruction 3 is predicated, Table 1 indicates that predication- 
support micro-ops 3a, 3b, 3e and 3f are generated. One predication-support append instruction 
and one predication-support cmov instruction is generated for each predicated destination 
operand, pi and p2. (The append micro-ops 3a and 3b that append a predicate value onto a 
predicate register effectively involve tracking a second bit along with the one-bit predicate value. 
The OOO renamer 116 supports this extra bit). 

[00055] Thus, append instruction 3a and cmov instruction 3e are generated for destination 

predicate register pi while append instruction 3b and cmov instruction 3f are generated for 

destination predicate register p2. Each of the six decomposed micro-ops 3 a-3f indicates a single 

destination register and two source operands. 
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[00056J Table 1 illustrates that micro-op 3e is a predication-support cmov micro-op that 
conditionally overwrites predicate register pi with the "first" value of pi that was saved into p- 
templ as a result of micro-op 3a. That is, if the appended predicate bit value in p-templ 
indicates a "false" value, then the predicate for the original instruction (instruction 3) was a 
5 "false" value. In such case, the value of pi computed as a result of micro-op 3c is overwritten 
with the "first" value of pi saved in p-templ as a result of micro-op 3a. Similarly, if the 
appended predicate bit value in p-templ indicates a "true" value, then the value of pi computed 
as a result of micro-op 3c is the desired value for pi. In such case, the value of pi computed as a 
result of micro-op 3c is copied to itself as a result of micro-op 3e. 

10 [00057] Similarly, micro-op 3f is a predication-support cmov micro-op for the second 

destination register, predicate register p2. Micro-op 3 f overwrites predicate register p2 with the 
"first" value of p2 that was saved into p-temp2 as a result of micro-op 3b. If the appended 
predicate bit value in p-temp2 indicates a "false" value, then the predicate for the original 
instruction (instruction 3) was a "false" value. In such case, the value of p2 computed as a result 

1 5 of micro-op 3d is overwritten with the "first" value of p2 saved in p-temp2 as a result of micro- 
op 3b. Similarly, if the appended predicate bit value in p-temp2 indicates a "true" value, then the 
value of p2 computed as a result of micro-op 3d is the desired value for p2. In such case, the 
value of p2 computed as a result of micro-op 3d is copied to itself as a result of micro-op 3f. 

[00058] The fourth predicated instruction in Table 1 is a predicated "unconditional compare 

20 instruction". As a result of executing an unconditional compare instruction, if the qp value for 

the instruction is "false", the values of the destination predicate registers are set to zero, rather 

than leaving the values unchanged. The instruction implements the following pseudo-code 

function for the first specified target predicate register: if (qp=true), then {destination register = 
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compare result} else {destination register = 0}. The instruction implements the following 
pseudo-code function for the second specified target predicate register: if (qp=true), then 
{destination register = (not)compare result} else {destination register = 0}. 

[00059] Upon a false qualifying predicate, the destination registers are set to zero rather than 
5 to the "first" values of the registers. For such operation, there is no need to save the "first" 
values of the predicate registers. Accordingly, Table 1 illustrates that predication-support 
append micro-ops for the destination registers are not generated for the unconditional compare 
instruction. 

[00060] If the qp value for the unconditional compare instruction is "true", then the 
10 instruction operates as a normal compare instruction. The compare result is written to the first 
destination predicate register (pi in the Table 1 example) while the complement of the compare 
result is written to the second destination predicate register (p2 in the Table 1 example). 

[00061] Because predication-support append micro-ops are not generated for the destination 
registers of an unconditional compare instruction such as sample instruction 4 set forth in Table 

15 1, a specialized form of the cmov micro-op is generated for the instruction (see micro-ops 4c and 
4d in Table 1). Ordinarily, the cmov micro-ops lc, 2e, 2f, 3e and 3f discussed above utilize an 
extra bit of the temporary variable to select the appropriate destination register output value from 
among two registers. Instead, the specialized predication-support cmov.unc micro-ops 4c and 4d 
generated for the unconditional compare instruction write a zero value to the destination register 

20 if the qp value is false. 

[00062] Accordingly, Table 1 illustrates that the unconditional compare instruction is 
decomposed (by, for instance, a (i-op generator such as 1 16 illustrated in Figs. 1 and 2) into only 
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micro-operations, 4a-4d. Each of the four decomposed micro-ops 4a-4d indicates a single 
destination register and two source operands. 

[00063] Standard micro-ops are generated for the basic operation of the unconditional 
compare instruction so that they each include only one destination and two sources. A standard 
5 compare micro-op (4a) is generated to set pi to a "true" value if the values in r2 and r3 are equal. 
If the two values are not equal, execution of micro-op 4a will set the value of pi to "false." 

[00064] A second standard compare micro-op (4b) is also generated to set p2 to a "true" 
value if the values in r2 and r3 are not equal. If the two values are equal, execution of micro-op 
4b will set the value of p2 to "false." 

10 [00065] Table 1 illustrates that a specialized predication-support conditional move micro- 
op, cmov.unc is generated for each predicated destination operand, pi and p2. Execution of the 
cmov.unc micro-ops, 4c and 4d, result in writing a data value of zero to the corresponding 
destination predicate register if the value of the qp operand is "false." Otherwise, if the value of 
the qp operand is "true," the value of the predicate register generated during execution of the 

15 unpredicated compare instruction (4a or 4b) is copied to itself. 

[00066] For at least one embodiment, the n-op generator 116 may utilize the cmov.unc 
micro-op illustrated at entries 4c and 4d of Table 1 in other unconditional update sequences. For 
example, the cmov.unc micro-op may be utilized in micro-op sequences generated for floating 
point reciprocal approximation and/or floating point reciprocal square root approximation. 

20 [00067] Table 1 illustrates that predication-support micro-op 4c is a cmov.unc micro-op that 
conditionally overwrites predicate register pi with a zero data value if the qp predicate bit value 
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for the instruction (instruction 4) indicates a "false" value. If, on the other hand, the qp predicate 
bit value for instruction 4 indicates a "true" value, then the value of pi that was computed as a 
result of micro-op 4a is the desired value for pi as a result of execution of instruction 4. In such 
case, the value of pi computed as a result of micro-op 4a is copied to itself as a result of 
5 execution of micro-op 4c. 

[00068] Similarly, micro-op 4d is also a predication-support cmov.unc micro-op for the 
second destination register, predicate register p2. Micro-op 4d conditionally overwrites 
predicate register p2 with a zero data value if the qp predicate bit value for the instruction 
(instruction 4) indicates a "false" value. If, on the other hand, the qp predicate bit value for 
10 instruction 4 indicates a "true" value, then the value of p2 that was computed as a result of 

micro-op 4b is the desired value for p2 as a result of execution of instruction 4. In such case, the 
value of p2 computed as a result of micro-op 4b is copied to itself as a result of execution of 
micro-op 4d. 

[00069] The fifth predicated instruction in Table 1 is one example of a predicated "parallel 
15 compare" instruction. Parallel compare instructions perform a logical operation (such as "or" or 
"and" operation) during the same cycle as a compare operation. A parallel compare instruction 
may thus be of an "or" or "and" form. For at least one embodiment, additional forms of parallel 
compare instructions are also supported. For example, a parallel compare instruction my support 
a first logical operation for one predicate target and another logical operation for a second 
20 predicate target. The DeMorgan form is a combination of an "or" type compare for one of the 
destination predicate registers and an "and" type compare for the other destination predicate 
register. In addition, another form of parallel compare instruction may compare the contents of 
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one source register with the complement of the contents of a second source register, and vice 
versa. Table 1 illustrates the "or" form of a predicated parallel compare instruction. 

[00070] Parallel compare instructions allow multiple simultaneous compare operations (of 
the same type) to target a single destination predicate register. Both the qp value and the result 
of the specified compare operation (which may be, for example, cmp.ne, cmp.eq, etc.) determine 
whether the target predicate registers are updated. For an embodiment, such as that indicated in 
Table 2, below, where parallel compare instructions do not update the destination registers under 
certain conditions, it is assumed that the destination registers are properly initialized before the 
parallel compare instruction. For at least one embodiment, parallel compare instructions may 
implement the pseudo-code operations specified in Table 2. 

Table 2 



Parallel Compare Type 


Operation 


First Predicate 
Destination 


Second Predicate 
Destination 


And 


and 


If(qp &&!result) 
then {target = 0} 


If(qp &&!result) 
then {target = 0} 


andcm 


If(qp && result) 
then {target = 0} 


If(qp && result) 
then {target = 0} 


Or 


or 


If(qp && result) 
then {target = 1 } 


If(qp && result) 
then {target = 1} 


orcm 


If(qp&&!result) 
then {target = 1} 


If(qp &&!result) 
then {target = 1} 


DeMorgan 


or.andcm 


If(qp&& result) 
then {target = 1} 


If(qp && result) 
then {target = 0} 


and.orcm 


If(qp &&!result) 
then {target = 0} 


If(qp &&!result) 
then {target = 1 } 
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[00071] Understanding operation of the parallel compare instructions is useful when 
determining the standard micro-ops to be generated during decomposition by the u-op generator 
1 16 (Figs. 1 and 2). In order to better understand the operation of the sample parallel compare 
operations indicated in Table 2, consider the truth tables set forth below in Tables 3-9. 

5 [00072] Table 3 illustrates that, for parallel compare instructions, the values of destination 
registers pi and p2 remain unchanged if the qualifying predicate (qp) for the parallel instruction 
is false: 



Table 3 - All parallel compare instructions; qp = false 



qpfor 
instruction 


Rslt of compare 


Incoming 
Value of 1 st 
destination 
predicate 
register 


Result value of 
1 st destination 
predicate 
register 


Incoming 
Value of r d 
destination 

predicate 
register 


Result value of 
^destination 
predicate register 


0 


Don't care 


X 


X 


y 


y 



-23- 



042390.P17014 
Express Mail No.: EV325527992US 



[00073] Tables 4-9 are truth tables for various illustrative forms of parallel compare 
instructions in the case that the qualifying predicate, qp, is true. Table 4, below, is a truth table 
for the parallel "and" instruction when the qualifying predicate is true. Table 4 illustrates that 
the initial value of the destination predicate registers for a parallel "and" instruction fall through 
unmodified if the result of the comparison designated for the instruction is true. For example, 
consider a sample parallel comp.eq.and instruction: cmp.eq.and pi, p2 = r2, r3. Lines 3 and 4 
of Table 4 illustrate that, if the contents of r2 and r3 are equal, then the initial values of pi and 
p2 fall through as the result values. In other words, these are "do nothing" cases. However, lines 
1 and 2 of Table 4 illustrate that the result values for the destination registers are forced to a zero 
(i.e., "false") value, regardless of the incoming value of the destination register, if the indicated 
comparison result is false. 



Table 4 - Parallel compare "and" instruction; qp = true 





qp for instruction 


Rslt of compare 
("rslt") 


Incoming Value 

of destination 
predicate register 

(P) 


Result value of 
destination 
predicate register 
(rslt and p) 


1. 


1 


0 


" 0 ' ■,<. . 


0 


2. 


1 


• 0 


■v 1 


Q 


3. 


1 


1 


0 


0 


4. 


1 


1 


1 


1 



[00074] Table 5 below, is a truth table for the parallel "or" instruction when the qualifying 
15 predicate is true. Table 5 illustrates that the initial value of the destination predicate registers for 
a parallel "or" instruction fall through unmodified if the result of the comparison designated for 
the instruction is false. For example, consider a sample parallel comp.eq.or instruction: 
cmp.eq.or pi, p2 = r2, r3. Lines 1 and 2 of Table 5 illustrate that, if the contents of r2 and r3 
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are not equal, then the initial values of pi and p2 fall through as the result values. In other 
words, these are "do nothing" cases. However, lines 3 and 4 of Table 5 illustrate the result 
values for the destination registers are forced to a true value, regardless of the incoming value of 
the destination register, if the indicated comparison result is true. That is, each destination 
5 predicate register pi and p2 is assigned a "true" value if the contents of source registers r2 and r3 
are equal. 



Table 5 - Parallel compare "or" instruction; qp = true 





qp for instruction 


Rslt of compare 


Incoming Value 


Result value of 






("rslt") 


of destination 


destination 








predicate register 


predicate register 








(P) 


(rslt or p) 


1. 


1 


0 


0 


0 


2. 


1 


0 


1 


1 


3. 


1 


1 


0 


• 1 


4. 


1 


1 


1 


1 



[00075] Table 6, below, is a truth table for the parallel "andcm" instruction when the 
10 qualifying predicate is true. The "andcm" instruction performs an "and" operation on the 
complement of the result of the indicated comparison with the contents of the indicated 
destination predicate register. For example, the "andcm" instruction cmp.eq.andcm pi, p2 = r2, 
r3 may be considered the functional equivalent of the corresponding "not equal" instruction: 
cmp.neq.and pi, p2 = r2, r3. 

15 [00076] Table 6 illustrates that the initial value of the destination predicate registers for a 

parallel "andcm" instruction fall through unmodified if the result of the comparison designated 

for the instruction is false. For example, consider a sample parallel comp.eq.andcm instruction: 

cmp.eq.andcm pi, p2 = r2, r3. Lines 1 and 2 of Table 6 illustrate that, if the contents of r2 and 
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r3 are not equal, then the initial values of pi and p2 fall through as the result values. In other 
words, these are "do nothing" cases. However, lines 3 and 4 of Table 6 illustrate the result 
values for the destination registers are forced to a false value, regardless of the incoming value of 
the destination register, if the indicated comparison result is true. That is, each destination 
5 register pi and p2 is assigned a "false" value if the contents of source registers r2 and r3 are 
equal. 



Table 6 - Parallel compare "andcm" instruction; qp = true 





qpfor 


Rsltof 


.'rslt 


Incoming Value of 


Result value of 




instruction 


compare 




destination predicate 


destination 






("rslt") 




register (p) 


predicate 












register (Irslt 












and p) 


1. 


1 


0 


1 


0 


0 


2. 


1 


0 


1 


1 


1 


3. 


1 


1 


0 


0 


0 


4. 


1 


1 


0 


. - 1 


0 



[00077] Table 7, below, is a truth table for the parallel "orcm" instruction when the 
10 qualifying predicate is true. The "orcm" instruction performs an "or" operation on the 
complement of the result of the indicated comparison with the contents of the indicated 
destination predicate register. For example, the sample "orcm" instruction cmp.eq.orcm pi, p2 
= r2, r3 may be considered the functional equivalent of the corresponding "not equal" 
instruction: cmp.neq.or pi, p2 = r2, r3. 

15 [00078] Table 7 illustrates that the initial value of the destination predicate registers for a 
parallel "orcm" instruction fall through unmodified if the result of the comparison designated for 
the instruction is true. For example, consider a sample parallel comp.eq.orcm instruction: 
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cmp.eq.orcm pi, p2 = r2, r3. Lines 3 and 4 of Table 7 illustrate that, if the contents of r2 and 
r3 are equal, then the initial values of pi and p2 fall through as the result values. In other words, 
these are "do nothing" cases. However, lines 1 and 2 of Table 7 illustrate the result values for 
the destination registers are forced to a "true" value, regardless of the incoming value of the 
5 destination register, if the indicated comparison result is false. That is, each of the destination 
predicate registers pi and p2 is assigned a "true" value if the contents of source registers r2 and 
r3 are not equal. 



Table 7 - Parallel compare "orcm" instruction; qp = true 





qpfor 


Rslt of compare 


.'rslt 


Incoming Value of 


Result value of 




instruction 


("rslt") 




destination predicate 


destination 










register (p) 


predicate 












register (Jrslt or 












P) 


1. 


1 


0 


1 


0 


1 


2. 


1 


0 


1 


1 


1 


3. 


1 


1 


0 


0 


0 


4. 


1 


1 


0 


1 


1 



10 [00079] Table 8, below, is a truth table for the DeMorgan parallel "or.andcm" instruction 
when the qualifying predicate is true. The "or.andcm" instruction may force different values for 
the destination predicate registers. For the first destination predicate register, the "or.andcm" 
parallel compare instruction performs an "or" operation on the result of the indicated comparison 
with the contents of the first indicated destination predicate register. For the second destination 

15 predicate register, the "or.andcm" parallel compare instruction performs an "and" operation on 
the complement of the result of the indicated comparison with the contents of the second 
indicated destination predicate register. 
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[00080] Table 8 illustrates that the initial values of the first and second destination predicate 
registers for a "or.andcm" parallel compare instruction fall through unmodified if the result of 
the comparison designated for the instruction is false. For example, consider a sample parallel 
comp.eq.or.andcm instruction: cmp.eq.or.andcm pi, p2 = r2, r3. Lines 1- 4 of Table 8 

5 illustrate that, if the contents of r2 and r3 are equal, then the initial values of p 1 and p2 fall 

through as the result values. In other words, these are "do nothing" cases. However, lines 5-8 of 
Table 8 illustrate the result value for the first predicate destination register is forced to a "true" 
value, regardless of the incoming value of the destination register, if the indicated comparison 
result is false. Lines 5-8 of Table 8 also illustrate the result value for the second predicate 

10 destination register is forced to a "false" value, regardless of the incoming value of the 

destination register, if the indicated comparison result is false. That is, if the contents of source 
destination registers r2 and r3 are equal, then destination register pi is set to a "true" value and 
destination register p2 is reset to a "false" value. 



Table 8 - Parallel compare "or.andcm" instruction; qp = true 





qpfor 

instruction 


Rslt of 
compare 
("rslt") 


Irslt 


Incoming 
Value of first 
destination 

predicate 
register (pi) 


Incoming Value 
of 2nd 
destination 
predicate 
register (p2) 


Result value 
of first 

destination 
predicate 

register (rslt 
or pi) 


Result 
value of 
2nd 
destination 
predicate 
register 
(.'rslt and 
p2) 


1. 




0 


1 


0 


0 


0 


0 


2. 




0 


1 


0 


1 


0 


1 


3. 




0 


1 


1 


0 




0 


4. 




0 


1 


1 


1 




1 


5. 




1 


0 


0 


0 




0 


6. 




1 


0 


0 


1 




0 


7. 




1 


0 


1 


0 




0 


8. 




1 


0 


1 


1 




0 
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[00081] Table 9, below, is a truth table for the DeMorgan "and.orcm" parallel compare 
instruction when the qualifying predicate is true. The "and.orcm" instruction may force different 
values for the destination predicate registers. For the first destination predicate register, the 
"and.orcm" parallel compare instruction performs an "and" operation on the result of the 
5 indicated comparison with the contents of the first indicated destination predicate register. For 
the second destination predicate register, the "and.orcm" parallel compare instruction performs 
an "or" operation on the complement of the result of the indicated comparison with the contents 
of the second indicated destination predicate register. 

[00082] Table 9 illustrates that the initial values of the first and second destination predicate 
10 registers for an "and.orcm" parallel compare instruction fall through unmodified if the result of 
the comparison designated for the instruction is true. For example, consider a sample parallel 
comp.eq.and.orcm instruction: cmp.eq.and.orcm pi, p2 = r2, r3. Lines 5-8 of Table 9 
illustrate that, if the contents of r2 and r3 are equal, then the initial values of pi and p2 fall 
through as the result values. In other words, these are "do nothing" cases. However, lines 1-4 of 
1 5 Table 9 illustrate the result value for the first predicate destination register is forced to a "false" 
value, regardless of the incoming value of the destination register, if the indicated comparison 
result is false. Lines 1-4 of Table 9 also illustrate that the result value for the second predicate 
destination register is forced to a "true" value, regardless of the incoming value of the second 
destination register, if the indicated comparison result is false. That is, if the contents of source 
20 destination registers r2 and r3 are not equal, then destination register pi is set to a "false" value 
and destination register p2 is reset to a "true" value 
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Table 9 - Parallel compare "and.orcm" instruction; qp = true 





HP for 
instruction 


Rsltof 
compare 
("rslt") 


.'rslt 


Incoming 
Value of first 
destination 
predicate 
register (pi) 


Incoming Value 
of 2nd 
destination 
predicate 
register (p2) 


Result value 

of first 
destination 
predicate 
register (rslt 
ana pi) 


Result 
value of 
2nd 
destination 
predicate 
regisier 
[ i rbii ur 


1. 




0 


1 


U 


n 
u 




1 

L 


Z. 


1 


o 


1 


o 


i 


0 


1 


3. 




0 


1 


1 


0 


0 


1 


4. 




0 


1 


1 


1 


0 


1 


5. 




1 


0 


0 


0 


0 


0 


6. 




1 


0 


0 


L 1 


0 


1 


7. 




1 


0 


1 


0 


1 


0 


8. 




1 


0 


1 


1 


1 


1 



[00083] The "or" form of the parallel compare instruction set forth at instruction 5 of Table 
1 is more complex than the unconditional compare set forth at instruction 4 of Table 1. That is, 
5 six micro-ops 5a-5f, instead of four, are generated for the predicated parallel compare 

instruction. While the following discussion focuses, for purposes of illustration, on the sample 
"or" form of a parallel compare instruction set forth at instruction 5 of Table 1, one skilled in the 
art will understand that the embodiments discussed herein of a method of decomposing an 
instruction into micro-ops is intended to encompass all forms of parallel compare instructions. 

10 [00084] For at least one embodiment, the six micro-ops 5a-5f are generated, for at least one 
embodiment, by the n-op generator 116 (Fig. 1). Each of the six decomposed micro-ops 5a-5f 
indicates a single destination register and two source operands. The decomposed micro-ops 5a- 
5 fare intended to simplify the execution hardware involved in executing instruction 5. 
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[00085] Brief reference to Figs. 8 and 9 illustrates that, as is shown in Table 1, in order to 
simplify the hardware that executes a predicated parallel compare, a sequence of six micro- 
operations 802, 804, 806, 808, 810, 812 may be generated for instruction 5 (800). 

[00086] For at least one embodiment, a hardware structures902 implements the truth table 
set forth in Table 5. Fig. 9 illustrates that the hardware 902 for execution of the original 
predicated parallel compare instruction 800 is relatively complex. It receives four inputs: qp, r2, 
r3 and the p value. One of skill in the art will understand that each p value pi, p2 is run through 
the hardware 902. For at least one embodiment, each p value is run concurrently through one of 
multiple hardware structures 902 along the lines illustrated in Fig. 9. One of skill in the art will 
recognize that alternative structure arrangements may be employed in order to achieve the 
functionality illustrated in Table 5. All such alternative structures are within the scope of the 
current disclosure. 

[00087] Fig. 8 illustrates that following sequence of micro-operations may be produced by 

the u-op generator 1 16 (Figs. 1, 2) for sample instruction 5 set forth above in Table 1. That is, 

predicated instruction 5 may be decomposed by u-op generator 1 16 into the following multiple 

micro-ops, where each micro-op has only two sources and only one destination: 

802. appendp-templ = pi , qp 

804. cmp.eq pi = r2, r3 

806. cmov.or p 1 = p-temp 1 , p 1 

808 . append p-temp2 = p2, qp 

810. cmp.eq p2 = r2, r3 

812. cmov.or p2 = p-temp2, p2 

[00088] Referring also to Table 1, it is seen that "standard" non-predicated compare micro- 
ops 5c (806) and 5d (810) are generated for the basic functional operation indicated by the 
parallel compare instruction (see Table 5). Table 1 indicates that, for the sample "or" parallel 
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compare instruction set forth at instruction 5, one standard compare micro-op is generated for 
each of the destination predicate registers indicated by the instruction (800). 

[00089] One standard compare micro-op 5c (806) is generated to set pi to a "true" value if 
the values in r2 and r3 are equal. If the two values are not equal, execution of standard micro-op 
5c (806) sets the value of pi to "false." 

[00090] A second standard compare micro-op 5d (810) is also generated to set p2 to a "true" 
value if the values in r2 and r3 are equal. If the two values are not equal, execution of micro-op 
5c (810) sets the value of p2 to "false." 

[00091] In addition, because instruction 5 is predicated, Table 1 and Fig. 8 indicate that 
predication-support micro-ops 5a (802), 5b (808), 5e (806) and 5f (812)are generated. One 
predication-support append instruction and one predication-support cmov instruction is thus 
generated for each predicated destination operand, pi and p2. 

[00092] Micro-op 5a (802), the append operation for pi, is a predication-support micro-op 
that copies the "first" value of pi, plus the predicate value in qp, into a non-architected internal 
temporary register, "p-temp". 

[00093] Micro-op 5b (808), the append operation for p2, is a also predication-support micro- 
op. Micro-op 5b copies the "first" value of p2, plus the predicate value in qp, into a non- 
architected internal temporary register, "p-temp2". 

[00094] Note that this approach of appending the qp to a predicate register value results, for 
at least one embodiment, in the addition of an extra bit to the data value associated with the 
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predicate register. For at least one embodiment, the predicate register file thus accommodates 1 
bit of data for the predicate register and also an additional 2 nd bit for the qualifying predicate. 

[00095] Micro-ops 5e and 5f, the conditional move operations 806 and 812, respectively, are 
predication-support micro-ops that effects a selection. Figs. 8 and 9 illustrate that inputs for the 
5 cmov.or micro-op are the p-templ value generated as a result of the append micro-op 5 a (802) 
and the pi value generated as a result of the compare micro-op 5c (804). Accordingly, hardware 
structure 904 to execute the cmov micro-ops 5e (806) and 5f (812) is less complicated than 
structure 902. 

[00096] The operation of the cmov.or micro-op is as follows. If the qp value appended and 
10 stored as an extra bit in the temporary predicate register (p-templ and p-temp2) is true, then the 
qualifying predicate for instruction 5 was true. If not, then the selection line for the mux 910 is 
false, and the value of "first" value of the predicate register falls through from the temporary 
predicate register as the destination value. 

[00097] However, if the qp value in the temporary predicate register is true, then the value 
15 on the selection line of the mux depends on the "second" value of the predicate register. This 
"second" value reflects the result of the compare operation performed as a result of the 
corresponding compare micro-op - 5c (804) for pi and 5d (810) for pi. 

[00098] Table 5, along with Fig. 9, reflects the operation of the mux 910 when the qp value 
in the temporary predicate register indicates a "true" value. Lines 1 and 2 indicate that, if the 
20 result of the compare operation, as reflected in the "second" value of the predicate register, is 
false, then the selection line goes low and the "first" value of the predicate register falls through 
from the temporary predicate register as the destination value. 
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[00099] In contrast, lines 3 and 4 of Table 5, along with Fig. 9, reflects that if the result of 
the compare operation, as reflected in the "second" value of the predicate register, is true, then 
the selection line goes high. In such case, a constant value of "true" (such as for instance, a 
value of lbT) is selected as the destination value. 

5 [000100] Fig. 9 thus illustrates that, prior to decomposition, selection circuit 902 to execute 
instruction 5 (800, Fig. 8) is relatively complex, requiring four inputs. In contrast, selection 
circuit 904 to execute the cmov.or micro-ops (806 and 812, Fig. 8) is simpler, receiving only two 
inputs. 

[000101] One of skill in the art will recognize, as is stated above, that similar decomposition 
10 may be performed for any other parallel compare instructions. The "or" parallel compare 
instruction illustrated in Figs. 8 and 9 is provided for illustrative purposes and should not be 
taken to be limiting. 

[000102] For example, the illustrative selection circuit 1040 illustrated in Fig. 10 may be 
utilized to execute the cmov.or. andcm micro-ops 10a- 1 Of (1002-1012) set forth in Table 10, 
15 below. Table 10 illustrates decomposition of an "or.andcm" form of DeMorgan parallel 

compare instruction into micro-ops lOa-lOf. Operation of selection circuit 1060, which includes 
muxes 1050 and 1055, implements the functionality indicated by the truth table set forth in 
Table 8. 
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Table 10 - DeMorgan Instruction 



Instruction 




iviicro-op sequences 


(qp) cmp.eq.or.andcm pi, p2 = 
r2, r3 


10a 


append p-templ =pl,qp 


1 \JIJ 


appciiu p-icmpz, — p^j qp 


10c 


cmp.eqpl = r2, r3 


lOd 


cmp.eq p2 = r2, r3 


lOe 


cmov.or.andcm pi = p- 
templ, pi 


lOf 


cmov.or.andcm p2 = p- 
temp2, p2 



[000103] Similarly, the illustrative selection circuit 1 140 illustrated in Fig. 1 1 may be utilized 
to execute the cmov.and.orcm micro-ops 1 la-1 If (1 102-1 1 12) set forth in Table 11, below. 
Table 11 illustrates decomposition of an "and.orcm" form of DeMorgan parallel compare 
instruction into micro-ops 1 la-1 If. Operation of selection circuit 1 160, which includes muxes 
1 150 and 1 155, implements the functionality indicated by the truth table set forth in Table 9. 



Table 11 - DeMorgan Instruction 



Instruction 




Micro-op Sequences 


(qp) cmp.eq.and.orcm pi, p2 = 
r2, r3 


5a 


append p-templ = pl,qp 


5b 


append p-temp2 = p2, qp 


5c 


cmp.eq pi =r2, r3 


5d 


cmp.eq p2 = r2, r3 


5e 


cmov.and.orcm pi = p- 
templ, pi 


5f 


cmov.and.orcm p2 = p- 
temp2, p2 



[000104] Similarly, one of skill in the art will also understand that selection circuits may also 
be built to perform the functionality shown in the truth tables set forth in Tables 4, 6 and 7, 
above. 
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[000105] The approaches, discussed above, of decomposing various multiple-destination 
instructions into a series of micro-ops may be generalized. Fig. 6. Fig. 6 is a flowchart 
illustrating at least one embodiment of a method 600 for decomposing a predicated multiple- 
destination instruction into a series of single-destination micro-ops. For at least one 
5 embodiment, a n-op generator 1 16 (Fig. 1) performs blocks 606-612 of the method 600 during a 
micro-op generation stage 310 of an execution pipeline 300 (Fig. 3). A fetch unit (not shown) 
may perform block 604 during a fetch stage 304 of an execution pipeline 300 (Fig. 3). 

[000106] Fig. 6 illustrates that processing for the method 600 begins at block 602 and 
proceeds to block 604. At block 604 the current predicated instruction is fetched from an 
10 instruction cache such as instruction cache 160 illustrated in Figs. 1 and 2. Responsive to a miss 
in the instruction cache, the instruction may be fetched from an instruction space in memory 
(such as, e.g., 150 in Figs. 1 and 2), such as instruction space 140 illustrated in Figs. 1 and 2. 

[000107] The instruction fetched at block 604 may be either a single-destination instruction or 
a multiple-destination instruction. Accordingly, for at least one embodiment, the instruction 

15 fetched at block 604 is a multiple-destination instruction. Such multiple-destination instruction 
may be of those forms set forth at instructions 2 through 5 of Table 1. One skilled in the art will 
recognize, however, that instruction architectures may also support other multiple-destination 
instructions. The techniques disclosed herein may employed on any instruction, including 
arithmetic and memory instructions, that indicates one or more destination registers. Supported 

20 multiple-destination instructions include general register compare instructions (such as those set 

forth in Tables 1, 10 and 11), floating point compare instructions, test bit instructions, test NaT 

bit instructions, floating-point class instructions, load base-increment instructions (such as 

instruction 2 set forth in Table 1) for general and floating point registers, and store base- 

-36- 042390.P17014 
Express Mail No.: EV325527992US 



increment instructions for general and floating point registers. Such listing is in no way intended 
to be an exhaustive list. 

[000108] Fig. 6 illustrates that processing proceeds to from block 604 to block 606. 
However, one of skill in the art will recognize that one or more operations (not shown) may be 
5 performed on the fetched instruction before performing block 606. For example, the instruction 
may be processed through a decode phase (see 306, Fig. 3) and/or an architectural rename phase 
(see 308, Fig. 3) of an execution pipeline before entering a u-op generation stage (see 310, Fig. 
3). For alternative embodiments, other operations may also be performed on the fetched 
instruction before the remaining blocks of Fig. 6 are performed. 

l o [0001 09] As is stated above, a u-op generator 1 1 6 (Fig. 1 ) may perform blocks 606 - 6 1 2 

during the u-op generation stage (310, Fig. 3). At block 606 standard u-ops are generated for the 
current instruction. As used herein, the term "standard" u-ops is intended to encompass those 
micro-ops that are generated for an instruction, regardless of whether the instruction is 
predicated. That is, even a non-predicated instruction may be decomposed into a plurality of u- 

15 ops in order to meet the goal that each micro-op indicates no more than two sources and one 
destination. 

[0001 10] For example, micro-ops 2c and 2d of Table 1 illustrate "standard" micro-ops for a 
register base update form of instruction (instruction 2 in Table 1). Even if instruction 2 were not 
predicated, the original instruction nonetheless indicates two registers to be updated - the 
20 destination register, r 1 , and the base register r3 . Accordingly, two "standard" micro-ops, 2c and 
2d, are generated such that a separate one-destination micro-op is generated for each of the two 
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destination register updates. For illustrative purposes, it can be assumed that the two standard 
micro-ops 2c and 2d are generated at block 606. Processing then proceeds to block 608. 

[000111] At block 608 it is determined whether the current instruction is predicated. If not, 
then the standard micro-ops generated at block 604 are the only micro-ops to be generated for the 
5 current instruction. Accordingly, if the predication check at block 608 evaluates to "false," then 
processing ends at block 614. 

[000112] If, however, the predication check at block 608 evaluates to "true," then processing 
proceeds to block 612. At block 612, additional predication-support micro-ops are generated so 
that each micro-op for the predicated instruction only indicates two inputs. For most 
10 instructions, the additional micro-ops generated at block 612 include conditional move 

instructions and may also include an append instruction for each destination register. After 
additional micro-ops are generated at block 612, processing ends at block 614. 

[000113] Fig. 7 illustrates in further detail at least one embodiment of a method for generating 
additional predication-support micro-ops at block 612. Fig. 7 illustrates that processing for the 
15 method 612 begins at block 702 and proceeds to block 704. 

[000114] At block 704, it is determined whether the "first" value of the current destination 
register is a predefined constant. Such determination would evaluate to "true," for instance, if 
the current instruction was an unconditional compare instruction such as that illustrated at 
instruction 4 of Table 1. For such an instruction, the value to be loaded into the destination 
20 register upon a certain condition is a known constant (in this case, the known constant is a zero 
value). Accordingly, a prior, or "first," value of the destination need not be preserved via an 
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append instruction. Fig. 7 illustrates that, in such case, processing proceeds to block 706. 
Otherwise, if the check at block 704 evaluates to "false," processing proceeds to block 708. 

[0001 15] At block 706, a predication-support micro-op, a conditional move micro-op, is 
generated in order to conditionally move one of two values into the destination register. The 
5 conditional move micro-op generated at block 706 is to move either the known constant value or 
a computed value into the current destination register for the current instruction. In most cases, 
the computed value is a value computed by the "standard" micro-ops generated at block 606 
(Fig. 6) for the current instruction. 

[000116] For illustrative purposes, the operation of blocks 704 and 706 is considered in light 
10 of sample instruction 4 set forth in Table 1 . (It should be noted that, during execution of the 
method 600 illustrated in Fig. 6, standard micro-ops 4a and 4b are generated for instruction 4 at 
block 606). The first destination register, pi, is considered at block 704. At block 704 it is 
determined that a predefined constant value (zero) will be moved into the destination register pi 
if the qualifying predicate for the current instruction is not true. Accordingly, block 704 
15 evaluates to "true," and processing then proceeds to block 706. 

[000117] At block 706, a conditional move micro-op is generated. Continuing with the 
example of instruction 4 of Table 1, conditional move micro-op 4c is generated at block 706. 
Micro-op 4c indicates that, if the qualifying predicate for instruction 4 is true, the value of pi 
computed by micro-op 4a is to be moved to destination register pi . Micro-op 4c also indicates 
20 that, if the qualifying predicate for instruction 4 is false, a zero value is to be moved to register 
pi. 



042390.P17014 
Express Mail No.: EV325527992US 



[000118] Processing then proceeds from block 706 to block 712. At block 712, it is 
determined whether additional destination registers are indicated by the current instruction. 
Continuing for the example of instruction 4 in Table 1, it would be determined on the first pass 
of block 712 that an additional destination register, p2, should be considered. Accordingly, 
5 processing proceeds back to block 704. If, however, all destination registers for the current 
instruction have been considered, then processing ends at block 714. 

[000119] At the second pass through block 704, for our sample instruction, it is again 
determined that a predefined constant value (i.e., zero) is to be moved to the destination register 
(p2) if the qualifying predicate for the current instruction is false. Accordingly, processing 
10 proceeds to block 706. 

[000120] At block 706, a conditional move micro-op is generated for the second destination 
register, p2. Continuing with the example of instruction 4 of Table 1, conditional move micro- 
op 4d is generated at block 706. Micro-op 4d indicates that, if the qualifying predicate for 
instruction 4 is true, the value of p2 computed by micro-op 4b is to be moved to destination 
15 register p2. Micro-op 4d also indicates that, if the qualifying predicate for instruction 4 is false, 
a zero value is to be moved to register p2. 

[000121] Processing then proceeds from block 706 to block 712. At the second pass through 
block 712 it is determined, for our example, that no further destination registers need be 
considered for the current instruction. Processing then ends at block 712. As is stated above, it 
20 is determined at block 712 whether additional destination registers are indicated by the current 
instruction. Continuing for the example of instruction 3 in 1, it would be determined on the first 
pass of block 712 that an additional destination register, p2, should be considered. Accordingly, 
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processing proceeds back to block 704. If, however, all destination registers for the current 
instruction have been considered, then processing ends at block 714. 

[000122] For illustrative purposes, Fig. 7 is further considered in light of sample instruction 3 
set forth in Table 1. (It should be noted that, during execution of the method 600 illustrated in 

5 Fig. 6, standard micro-ops 3c and 3d are generated for instruction 3 at block 606). The first 
destination register, pi, for the instruction is considered at block 704. At block 704 it is 
determined that instruction 3, a comp.eq instruction, is not a form of instruction for which a 
predefined constant value will be moved into the destination register pi if the qualifying 
predicate for the current instruction is not true. Accordingly, block 704 evaluates to "false," and 

10 processing then proceeds to block 708. 

[000123] Because the "first" value of pi will be selected as the output value for pi if the 
qualifying predicate for instruction 3 is not true, this value should be saved in a temporary 
register. Accordingly, at block 708 a predication-support append instruction is generated to save 
this information. For our example, micro-op 3a is generated for instruction 3 at the first pass of 
15 block 708. The operation of the append instruction 3a not only saves the "first" value of pi, but 
it appends the qp value onto the register as well, as is described above. Processing then proceeds 
to block 710. 

[000124] At block 710, a second predication-support micro-op is generated for the current 
destination register, pi . Continuing with the example of instruction 3 from Table 1, conditional 
20 move micro-op 3e is generated at block 710. Micro-op 3e indicates that, if the qualifying 
predicate for instruction 3 is true, the value of pi computed by standard micro-op 3c is to be 
moved to destination register p 1 . Micro-op 3e also indicates that, if the qualifying predicate for 
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instruction 3 is false, the "first" value of pi from the temporary register is to be moved to 
destination register pi. Processing then proceeds to block 712, where it is determined that a 
second destination register, p2, should be considered. Processing thus proceeds back to block 
704. 

5 [000125] At the second pass through block 704, for our sample instruction, it is again 

determined that instruction 3 is not a form of instruction for which a predefined constant value 
will be moved into the destination register pi if the qualifying predicate for the current 
instruction is not true. Accordingly, block 704 again evaluates to "false," and processing then 
proceeds to block 708. 

10 [000126] At the second pass of block 708, a predication-support append instruction is 
generated to save the "first" value of p2. For our example, micro-op 3b is generated for 
instruction 3 at the second pass of block 708. The operation of the append instruction 3b 
appends the qp value onto the "first" value of p2 and saves them both in a temporary register. 
Processing then proceeds to block 710. 

15 [000127] At block 710, a second predication-support micro-op is generated for the current 
destination register, p2. Continuing with the example of instruction 3 from Table 1, conditional 
move micro-op 3f is generated at block 710. Micro-op 3f indicates that, if the qualifying 
predicate for instruction 3 is true, the value of p2 computed by standard micro-op 3d is to be 
moved to destination register p2. Micro-op 3f also indicates that, if the qualifying predicate for 

20 instruction 3 is false, the "first" value of p2 from the temporary register is to be moved to 
destination register p2. Processing then proceeds to block 712. 
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[000128] At the second pass of block 712, for our sample instruction, it is determined that no 
further destination registers need be considered. Processing thus ends at block 714. 

[000129] The foregoing discussion discloses selected embodiments of an apparatus, system 
and method for implementing single- and multiple-destination instructions, including predicated 
5 arithmetic and memory instructions, using single-destination micro-operations. The methods 
described herein may be performed on a processing system such as the processing systems 100, 
100a illustrated in Figs. 1 and 2. 

[000130] Figs. 1 and 2 illustrate embodiments of processing systems 100, 100a, respectively, 
that may utilize disclosed techniques. Systems 100, 100a may be used, for example, to execute 

10 one or more methods implementing predicated instructions using micro-operations, such as the 
embodiments described herein. For purposes of this disclosure, a processing system includes any 
processing system that has a processor, such as, for example; a digital signal processor (DSP), a 
microcontroller, an application specific integrated circuit (ASIC), or a microprocessor. Systems 
100 and 100a are representative of processing systems based on the Itanium® and Itanium® 2 

15 microprocessors as well as the Pentium®, Pentium® Pro, Pentium® II, Pentium® III, and 

Pentium® 4 microprocessors, all of which are available from Intel Corporation. Other systems 
(including personal computers (PCs) having other microprocessors, engineering workstations, 
personal digital assistants and other hand-held devices, set-top boxes and the like) may also be 
used. At least one embodiment of system 100 may execute a version of the Windows™ 

20 operating system available from Microsoft Corporation, although other operating systems and 
graphical user interfaces, for example, may also be used. 

[000131] Processing systems 100 and 100a include a memory system 150 and a processor 

101, 101a. Memory system 150 may store instructions 140 and data 141 for controlling the 
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operation of the processor 101. Memory system 150 is intended as a generalized representation 
of memory and may include a variety of forms of memory, such as a hard drive, CD-ROM, 
random access memory (RAM), dynamic random access memory (DRAM), static random access 
memory (SRAM), flash memory and related circuitry. Memory system 150 may store 
5 instructions 140 and/or data 141 represented by data signals that may be executed by the 
processor 101. 

[000132] In the preceding description, various aspects of a method, apparatus and system for 
implementing predicated instructions using micro-operations are disclosed. For purposes of 
explanation, specific numbers, examples, systems and configurations were set forth in order to 

10 provide a more thorough understanding. However, it is apparent to one skilled in the art that the 
described method and apparatus may be practiced without the specific details. It will be obvious 
to those skilled in the art that changes and modifications can be made without departing from the 
present invention in its broader aspects. While particular embodiments of the present invention 
have been shown and described, the appended claims are to encompass within their scope all 

15 such changes and modifications that fall within the true scope of the present invention. 
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